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DETAILED ACTION 

Response to Amendment 

1 . In response to the Advisory Action mailed 1/12/09, applicant has submitted an 
amendment and Request for Continued Examination filed 1/27/09. 
Claims 1,8, 15, 20, and 25, have been amended. 



Response to Arguments 

1 . Applicant's arguments with respect to claims 1 , 8, 1 5, 20, and 25, have been 
considered but are moot in view of the new ground(s) of rejection. 



Claim Rejections - 35 USC § 103 

2. The following is a quotation of 35 U.S.C. 1 03(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 



1 . Claims 1-5, 8-10, and 12-27 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Katsumi Patent No.: US 6,369,846 ("KATSUMI") in view of Nefian 
Pub. No.: US 2003/0212557 ("NEFIAN") and Lubiarz et al. (US 7,003,452), hereafter 



Lubiarz, and Veltman (US 5,481 ,543). 
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2. Regarding claim 1, KATSUMI teaches a method, comprising: 

electronically capturing visual features associated with a speaker speaking 
("speaking attendee determination information based on video signal", KATSUMI, 
column 6, lines 52-53); 

electronically capturing audio and wherein the visual features are separated form 
the audio and processed separately, and separated based on time when each was 
captured ("speaking attendee determination information based on audio signal", 
KATSUMI, column 6, lines 50-51; col. 6, lines 57-67; col. 4, lines 34-46; Figure 3; See 
Response to Arguments); 

matching selective portions of the audio with the visual features ("when the 
'speaking attendee determination information based on audio signal' represents a voice 
and the 'speaking attendee determination information based on video signal' represents 
a change of the shape of the lip portion simultaneously", KATSUMI, column 6, lines 58- 
63); and 

identifying the remaining and unmatched portions of the audio as potential noise 
not associated with the speaker speaking ("if an audio signal contains a noise such as a 
page turning noise and voices of other people along with a voice of a conference 
attendee, since an image of the motion of the lip portion of a conference attendee can 
be detected from a video signal, the speaking attendee can be determined", KATSUMI, 
column 7, lines 63-67). 
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However, KATSUMI does not disclose that the method occurs during a training 
session; that the visual features include a face recognition of the speaker and a mouth 
recognition within pixels associated with the face that detects when the mouth is moving 
and when the mouth is not moving by differences in the pixels from frame to frame in 
the captured visual features during the training session; and detecting frequencies in the 
captured audio during same time slices of the training session for the captured visual 
features when the mouth is detected as moving and when the mouth is detected as not 
moving. 

In the same field of audiovisual processing, NEFIAN teaches: 

a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]); 

visual features including a face recognition of the speaker and a mouth 
recognition within pixels associated with the face that detects when the mouth is moving 
and when the mouth is not moving by differences in the pixels from frame to frame in 
the captured visual features during the training session (see NEFIAN, paragraphs 
[0018]-[0020], a series of vector calculations are performed on the pixels representing 
the mouth regions); and 

detecting frequencies in the captured audio ("13 MFCC coefficients extracted 
from a window of 20 ms", NEFIAN, paragraph [0055]) during same time slices of the 
training session for the captured visual features when the mouth is detected as moving 
and when the mouth is detected as not moving ("discrete nodes at time t for each HMM 
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are conditioned by the discrete nodes at time t1 of all the related HMMs", NEFIAN, 
paragraph [0023]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

Katsumi, in view of Nefian, fail to teach where the detecting frequencies is 
detecting bands of frequencies in the captured audio during time slices, and where the 
detecting bands of frequencies is to determine when the speaker is speaking and when 
the speaker is not speaking during the training session. 

Lubiarz teaches where the detecting frequencies is detecting bands of 
frequencies in the captured audio during time slices, and where the detecting bands of 
frequencies is to determine when the speaker is speaking and when the speaker is not 
speaking during the training session ("speech recognition", col. 2, lines 28-37; "detects 
silence... detects the presence of voice activity... optimized... for each of the frequency 
bands", col. 5, lines 17-34). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian, to include the teaching of Lubiarz 
of where the detecting frequencies is detecting bands of frequencies in the captured 
audio during time slices, and where the detecting bands of frequencies is to determine 
when the speaker is speaking and when the speaker is not speaking during the training 



Application/Control Number: 10/813,642 Page 6 

Art Unit: 2626 

session, in order to ensure that the most effective processing for a particular type of 
signal is applied to that type of signal, as described by Lubiarz (col. 1, lines 5-10). 

Katsumi, in view of Nefian and Lubiarz, fail to teach wherein the visual features 
and audio are initially captured at a different rate from one another. 

Veltman suggests wherein the visual features and audio are initially captured at a 
different rate from one another and the separation is based on time stamps ("sampling 
rate clock of the audio signal and the frame rate clock of the video system operate 
independently... time stamp", col. 1, lines 42-54; "sampling rate control circuit... for use 
in decoding the video stream and the audio stream respectively... video decoder 
removes each access unit... sampling rate controller... controlled by... time stamps", 
col. 5, lines 30-61 ; where independent clocks suggests different rates and the decoding 
is a separating of demultiplexed data based on time stamps) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian and Lubiarz, to include the 
teaching of Veltman wherein the visual features and audio are initially captured at a 
different rate from one another and the separation is based on time stamps in order to 
ensure proper decoding of an encoded data, as described by Veltman (col. 1 , lines 55- 
60). 



3. 



Regarding claim 2, KATSUMI further teaches: 
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electronically capturing additional visual features associated with a different 
speaker speaking ("images of terminals determined as having speaking attendees", 
KATSUMI, column 7, lines 56-57); and 

matching some of the remaining portions of the audio from the potential noise 
with the additional speaker speaking ("if an audio signal contains a noise such as a 
page turning noise and voices of other people along with a voice of a conference 
attendee, since an image of the motion of the lip portion of a conference attendee can 
be detected from a video signal, the speaking attendee can be determined", KATSUMI, 
column 7, lines 63-67). 

4. Regarding claim 3, NEFIAN further teaches generating parameters associated 
with the matching and the identifying ("audio processing and visual feature extraction", 
NEFIAN, paragraph [0012]) and providing the parameters to a Bayesian Network which 
models the speaker speaking ("video data must be fused with audio data using... a 
coupled hidden Markov model [HMM]", NEFIAN, paragraph [0023], where the HMM is a 
dynamic Bayesian network). 

5. Regarding claim 4, NEFIAN further teaches that electronically capturing the 
visual features further includes processing a neural network ("neural network", NEFIAN, 
paragraph [0014]) against electronic video associated with the speaker speaking 
("speaker's face in a video sequence", NEFIAN, paragraph [0014]), wherein the neural 
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network is trained to detect and monitor the face of the speaker ("face detection", 
NEFIAN, paragraph [0014]). 

6. Regarding claim 5, NEFIAN further teaches filtering the detected face of the 
speaker to detect movement or lack of movement in the mouth of the speaker ("after the 
face is detected, mouth region discrimination is usual", NEFIAN, paragraph [0015]). 

7. Regarding claim 8, KATSUMI teaches a method, comprising: 

monitoring an electronic video of a first speaker and a second speaker ("images 
of terminals determined as having speaking attendees", KATSUMI, column 7, lines 56- 
57); 

concurrently capturing audio associated with the first and second speaker 
speaking ("voices may be contained in the audio signal", KATSUMI, column 3, line 1); 

analyzing the video to detect when the first and second speakers are moving 
their respective mouths ("extracts the change amount of the shape of the lip portion", 
KATSUMI, column 6, lines 24-25); 

the audio and video are separated and then compared based on time ("speaking 
attendee determination information based on audio signal", KATSUMI, column 6, lines 
50-51; col. 6, lines 57-67; col. 4, lines 34-46; Figure 3; See Response to Arguments); 
and 

matching portions of the captured audio to the first speaker and other portions to 
the second speaker based on the analysis and wherein at least some points in the 
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training session indicate that the first and second speakers are simultaneously speaking 
("if an audio signal contains a noise such as a page turning noise and voices of other 
people along with a voice of a conference attendee, since an image of the motion of the 
lip portion of a conference attendee can be detected from a video signal, the speaking 
attendee can be determined", KATSUMI, column 7, lines 63-67). 

However, KATSUMI does not disclose a training session for face recognition of 
the first and second speakers; indications as to when mouths for the first and second 
speakers are moving and not moving from frame to frame of the video during the 
training session; audio separated from video and matched back to a corresponding 
portion of the video via a particular time slice associated with both the audio and the 
video; detecting differences in pixels within the faces occurring from frame to frame of 
the video for each of the speakers; and detecting changes in frequencies within the 
audio for a same time slice that indicates a particular mouth of one of the speakers is 
moving and by noting a particular frequency for a particular one of the speakers, and 
discerning what each is saying based on their respective frequencies that were noted. 

In the same field of audiovisual processing, NEFIAN teaches: 

a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]); 

indications as to when mouths for the first and second speakers are moving and 
not moving from frame to frame of the video during the training session (see NEFIAN, 
paragraphs [0018]-[0020], a series of vector calculations are performed on the pixels 
representing the mouth regions); 
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audio separated from video ("audiovisual data is separately subjected to audio 
processing and visual feature extraction 14", NEFIAN, paragraph [0012]) and matched 
back to a corresponding portion of the video ("video data must be fused with audio 
data", NEFIAN, paragraph [0023]) via a particular time slice associated with both the 
audio and the video ("discrete nodes at time tfor each HMM are conditioned by the 
discrete nodes at time t1 of all the related HMMs", NEFIAN, paragraph [0023]); 

detecting differences in pixels within the faces occurring from frame to frame of 
the video for each of the speakers (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions); and 

detecting changes in frequencies within the audio ("13 MFCC coefficients 
extracted from a window of 20 ms", NEFIAN, paragraph [0055]) for a same time slice 
that indicates a particular mouth of one of the speakers is moving ("discrete nodes at 
time t for each HMM are conditioned by the discrete nodes at time t1 of all the related 
HMMs", NEFIAN, paragraph [0023]) and by noting a particular frequency for a particular 
one of the speakers ("13 MFCC coefficients extracted from a window of 20 ms", 
NEFIAN, paragraph [0055]), and discerning what each is saying based on their 
respective frequencies that were noted ("for audio-only speech recognition", NEFIAN, 
paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 
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Katsumi, in view of Nefian, fail to teach where the detecting frequencies is 
detecting bands of frequencies in the captured audio during time slices, where the 
detecting bands of frequencies is to determine when the speaker is speaking and when 
the speaker is not speaking during the training session, and where a particular band of 
frequency is noted to determine when the speaker is speaking and when the speaker is 
not speaking during the training session. 

Lubiarz teaches where the detecting frequencies is detecting bands of 
frequencies in the captured audio during time slices, where the detecting bands of 
frequencies is to determine when the speaker is speaking and when the speaker is not 
speaking during the training session, and where a particular band of frequency is noted 
to determine when the speaker is speaking and when the speaker is not speaking 
during the training session ("speech recognition", col. 2, lines 28-37; "detects silence... 
detects the presence of voice activity... optimized... for each of the frequency bands", 
col. 5, lines 17-34; exceeding a threshold notes a frequency band). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian, to include the teaching of Lubiarz 
of where the detecting frequencies is detecting bands of frequencies in the captured 
audio during time slices, where the detecting bands of frequencies is to determine when 
the speaker is speaking and when the speaker is not speaking during the training 
session, and where a particular band of frequency is noted to determine when the 
speaker is speaking and when the speaker is not speaking during the training session, 
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in order to ensure that the most effective processing for a particular type of signal is 
applied to that type of signal, as described by Lubiarz (col. 1 , lines 5-1 0). 

Katsumi, in view of Nefian and Lubiarz, fail to teach wherein the visual features 
and audio are initially captured at a different rate from one another and the 
separation/comparison is based on time stamps. 

Veltman suggests wherein the visual features and audio are initially captured at a 
different rate from one another and the separation/comparison is based on time stamps 
("sampling rate clock of the audio signal and the frame rate clock of the video system 
operate independently... time stamp", col. 1, lines 42-54; "sampling rate control circuit... 
for use in decoding the video stream and the audio stream respectively... video decoder 
removes each access unit... sampling rate controller... controlled by... time stamps", 
col. 5, lines 30-61 ; where independent clocks suggests different rates and the decoding 
is a separating of demultiplexed data based on time stamps) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian and Lubiarz, to include the 
teaching of Veltman of wherein the visual features and audio are initially captured at a 
different rate from one another and the separation/comparison is based on time stamps 
in order to ensure proper decoding of an encoded data, as described by Veltman (col. 1 , 
lines 55-60). 
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8. Regarding claim 9, NEFIAN further teaches modeling the analysis for 
subsequent interactions with the first and second speakers ("the result is a model for the 
underlying process", NEFIAN, paragraph [0033]). 

9. Regarding claim 10, NEFIAN further teaches that analyzing further includes 
processing a neural network ("neural network", NEFIAN, paragraph [0014]) for detecting 
the faces of the first and second speakers ("speaker's face in a video sequence", 
NEFIAN, paragraph [0014]) and processing vector classifying algorithms to detect when 
the first and second speakers' respective mouths are moving or not moving (see 
NEFIAN, paragraphs [0018]-[0020], a series of vector calculations is performed on the 
mouth regions). 

10. Regarding claim 13, KATSUMI further teaches identifying selective portions of 
the captured audio as noise if the selective portions have not been matched to the first 
speaker or the second speaker ("if an audio signal contains a noise such as a page 
turning noise and voices of other people along with a voice of a conference attendee, 
since an image of the motion of the lip portion of a conference attendee can be detected 
from a video signal, the speaking attendee can be determined", KATSUMI, column 7, 
lines 63-67). 

1 1 . Regarding claim 14, NEFIAN further teaches that matching further includes 
identifying time dependencies associated with when selective portions of the electronic 
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video were monitored and when selective portions of the audio were captured ("discrete 
nodes at time t for each HMM are conditioned by the discrete nodes at time t1 of all the 
related HMMs", NEFIAN, paragraph [0023]). 

12. Regarding claim 15, KATSUMI teaches a system, comprising: 

a camera (see KATSUMI, column 4, lines 22-23, the conference terminals 
produce video signals, therefore a camera is inherent); 

a microphone (see KATSUMI, column 4, lines 22-23, the conference terminals 
produce audio signals, therefore a microphone is inherent); and 

a processing device ("MCU", KATSUMI, column 4, line 21), wherein the camera 
captures video of a speaker and communicates the video to the processing device, the 
microphone captures audio associated with the speaker and an environment of the 
speaker and communicates the audio to the processing device ("the conference 
terminals 6a to 6c multiplex video signals and audio signals of locations [A] to [C]... and 
transmit the transmission signals to the MCU", KATSUMI, column 4, lines 22-26) and 
the video and audio separated from one another ("speaking attendee determination 
information based on audio signal", KATSUMI, column 6, lines 50-51; col. 6, lines 57- 
67; col. 4, lines 34-46; Figure 3; See Response to Arguments);, the processing device 
includes instructions that identifies visual features of the video where the speaker is 
speaking ("speaking attendee determination information based on video signal", 
KATSUMI, column 6, lines 52-53) and uses time dependencies to match portions of the 
audio to those visual features ("when the 'speaking attendee determination information 
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based on audio signal' represents a voice and the 'speaking attendee determination 
information based on video signal' represents a change of the shape of the lip portion 
simultaneously", KATSUMI, column 6, lines 58-63). 

However, KATSUMI does not disclose a plurality of frames within a period of time 
designated as a training session and each frame associated with a particular time slice; 
audio associated with the particular time slice of the training session; and a processing 
device that recognizes a face of the speaker in each frame of the video and a mouth 
within the face and detects when the mouth is moving or not moving from frame to 
frame of the video by changes in pixels associated with the mouth, and wherein when 
the mouth is moving a detected change in frequency within the same time slice of the 
audio identifies the speaker as speaking and a particular frequency that uniquely 
identifies the speaker when the speaker is speaking. 

In the same field of audiovisual processing, NEFIAN teaches: 
a plurality of frames ("digital form including but not limited to MPEG-2", NEFIAN, 
paragraph [001 1]) within a period of time designated as a training session ("training 
network and speech recognition module 18", NEFIAN, paragraph [0012]) and each 
frame associated with a particular time slice (MPEG frames are inherently associated 
with a time slice); 

audio associated with the particular time slice of the training session ("video data 
must be fused with audio data", NEFIAN, paragraph [0023]); and 

a processing device that recognizes a face of the speaker in each frame of the 
video ("speaker's face in a video sequence", NEFIAN, paragraph [0014]) and a mouth 
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within the face and detects when the mouth is moving or not moving from frame to 
frame of the video by changes in pixels associated with the mouth (see NEFIAN, 
paragraphs [0018]-[0020], a series of vector calculations are performed on the pixels 
representing the mouth regions), and wherein when the mouth is moving a detected 
frequency within the same time slice of the audio identifies the speaker as speaking 
("discrete nodes at time t for each HMM are conditioned by the discrete nodes at time t1 
of all the related HMMs", NEFIAN, paragraph [0023]; "13 MFCC coefficients extracted 
from a window of 20 ms", NEFIAN, paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

Katsumi, in view of Nefian, fail to teach where the detecting frequencies is 
detecting bands of frequencies, determining that the speaker is speaking, and a 
particular band of frequency that uniquely identifies the speaker when the speaker is not 
speaking. 

Lubiarz suggests where the detecting frequencies is detecting bands of 
frequencies, determining that the speaker is speaking, and a particular band of 
frequency that uniquely identifies the speaker when the speaker is not speaking 
("speech recognition", col. 2, lines 28-37; "detects silence... detects the presence of 
voice activity... optimized... for each of the frequency bands", col. 5, lines 17-34; 
speech occurs over a relatively defined range of frequencies [i.e., band], and so this 
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range is a unique band that identifies whether the speaker is speaking or not speaking. 
Also, since Lubiarz teaches processing only when speech is present, Lubiarz suggests 
where there is an indicator [e.g., the voice activity decision] that is output at a particular 
time to tell the system that it is time to perform the speech recognition with face 
movement recognition taught by Katsumi and Nefian). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian, to include the teaching of Lubiarz 
of where the detecting frequencies is detecting bands of frequencies, determining that 
the speaker is speaking, and a particular band of frequency that uniquely identifies the 
speaker when the speaker is not speaking, in order to ensure that the most effective 
processing for a particular type of signal is applied to that type of signal, as described by 
Lubiarz (col. 1, lines 5-10). 

Katsumi, in view of Nefian and Lubiarz, fail to teach wherein the visual features 
and audio are initially captured at a different rate from one another and the association 
is based on time stamps. 

Veltman suggests wherein the visual features and audio are initially captured at a 
different rate from one another and the association is based on time stamps, ("sampling 
rate clock of the audio signal and the frame rate clock of the video system operate 
independently... time stamp", col. 1 , lines 42-54; "sampling rate control circuit... for use 
in decoding the video stream and the audio stream respectively... video decoder 
removes each access unit... sampling rate controller... controlled by... time stamps", 
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col. 5, lines 30-61 ; where independent clocks suggests different rates and the decoding 
is a separating of demultiplexed data based on time stamps) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian and Lubiarz, to include the 
teaching of Veltman of wherein the visual features and audio are initially captured at a 
different rate from one another and the association is based on time stamps in order to 
ensure proper decoding of an encoded data, as described by Veltman (col. 1 , lines 55- 
60). 

13. Regarding claim 16, KATSUMI further teaches that the captured video also 
includes images of a second speaker ("images of terminals determined as having 
speaking attendees", KATSUMI, column 7, lines 56-57) and the audio includes sounds 
associated with the second speaker ("voices may be contained in the audio signal", 
KATSUMI, column 3, line 1), and wherein the instructions matches some portions of the 
audio to the second speaker when some of the visual features indicate the second 
speaker is speaking ("if an audio signal contains a noise such as a page turning noise 
and voices of other people along with a voice of a conference attendee, since an image 
of the motion of the lip portion of a conference attendee can be detected from a video 
signal, the speaking attendee can be determined", KATSUMI, column 7, lines 63-67). 



Application/Control Number: 10/813,642 Page 19 

Art Unit: 2626 

Regarding claim 17, NEFIAN further teaches instructions that interact with a 
neural network ("neural network", NEFIAN, paragraph [0014]) to detect the face of the 
speaker from the captured video ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]). 

14. Regarding claim 18, NEFIAN further teaches that the instructions interact with a 
pixel vector algorithm to detect when the mouth associated with the face moves or does 
not move within the captured video (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions). 

1 5. Regarding claim 19, NEFIAN further teaches that the instructions generate 
parameter data ("audio processing and visual feature extraction", NEFIAN, paragraph 
[0012]) that configures a Bayesian network ("video data must be fused with audio data 
using... a coupled hidden Markov model [HMM]", NEFIAN, paragraph [0023], where the 
HMM is a dynamic Bayesian network) which models subsequent interactions with the 
speaker ("the result is a model for the underlying process", NEFIAN, paragraph [0033]) 
to determine when the speaker is speaking and to determine appropriate audio to 
associate with the speaker speaking in the subsequent interactions ("speech 
recognition", NEFIAN, paragraph [0023]). 

16. Regarding claim 20, KATSUMI teaches a machine accessible medium having 
associated instructions, which when accessed, results in a machine performing: 
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separating audio and video associated with a speaker speaking into separate 
frames for analysis (see KATSUMI, FIG. 3, the audio processing is separate from the 
video processing); 

identifying visual features from the video that indicate a mouth of the speaker is 
moving or not moving ("extracts the change amount of the shape of the lip portion", 
KATSUMI, column 6, lines 24-25); and 

associating portions of the audio with selective ones of the visual features that 
indicate the mouth is moving ("when the 'speaking attendee determination information 
based on audio signal' represents a voice and the 'speaking attendee determination 
information based on video signal' represents a change of the shape of the lip portion 
simultaneously", KATSUMI, column 6, lines 58-63). 

However, KATSUMI does not disclose: a training session, wherein each frame 
associated with a same time line to permit specific frames of the video to be matched to 
specific frequencies of the audio during a same time slice occurring along the time line 
for the training session; identifying a face of the speaker and then identifying pixels 
within the face that represents the mouth and then noting changes in the pixels from 
frame to frame of the video along the time line; and matching frequencies of the audio 
with detected movements of the mouth during a same time period within the time line 
and associating a frequency with the speaker when the mouth is moving. 

In the same field of audiovisual processing, NEFIAN teaches: 

a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]), wherein each file associated with a same time line to 
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permit specific frames of the video to be matched to specific frequencies of the audio 
during a same time slice occurring along the time line for the training session ("discrete 
nodes at time t for each HMM are conditioned by the discrete nodes at time t1 of all the 
related HMMs", NEFIAN, paragraph [0023]); 

identifying a face of the speaker ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]) and then identifying pixels within the face that represents the mouth 
and then noting changes in the pixels from frame to frame of the video along the time 
line mouth (see NEFIAN, paragraphs [0018]-[0020], a series of vector calculations are 
performed on the pixels representing the mouth regions); and 

matching frequencies of the audio with detected movements of the mouth during 
a same time period within the time line ("video data must be fused with audio data", 
NEFIAN, paragraph [0023]) and associating a particular frequency with the speaker 
when the mouth is moving ("13 MFCC coefficients extracted from a window of 20 ms", 
NEFIAN, paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

Katsumi, in view of Nefian, fail to teach where the detecting frequencies is 
detecting bands of frequencies in the captured audio during time slices, where the 
detecting bands of frequencies is to determine when the speaker is speaking and when 
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the speaker is not speaking during the training session, and where the frequencies are 
bands of frequencies. 

Lubiarz teaches where the detecting frequencies is detecting bands of 
frequencies in the captured audio during time slices, where the detecting bands of 
frequencies is to determine when the speaker is speaking and when the speaker is not 
speaking during the training session, and where the frequencies are bands of 
frequencies ("speech recognition", col. 2, lines 28-37; "detects silence... detects the 
presence of voice activity... optimized... for each of the frequency bands", col. 5, lines 
17-34; exceeding a threshold indicates speech in a frequency band, and performing the 
appropriate processing as per Katsumi and Nefian includes matching the speech 
portion with the appropriate video). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian, to include the teaching of Lubiarz 
of where the detecting frequencies is detecting bands of frequencies in the captured 
audio during time slices, where the detecting bands of frequencies is to determine when 
the speaker is speaking and when the speaker is not speaking during the training 
session, and where the frequencies are bands of frequencies, in order to ensure that 
the most effective processing for a particular type of signal is applied to that type of 
signal, as described by Lubiarz (col. 1, lines 5-10). 

Katsumi, in view of Nefian and Lubiarz, fail to teach the audio and video originally 
captured at a different rate from one another and time stamps occur along the time line 
for the training session. 



Application/Control Number: 10/813,642 Page 23 

Art Unit: 2626 

Veltman suggests the audio and video originally captured at a different rate from 
one another and time stamps occur along the time line for the training session 
("sampling rate clock of the audio signal and the frame rate clock of the video system 
operate independently... time stamp", col. 1, lines 42-54; "sampling rate control circuit... 
for use in decoding the video stream and the audio stream respectively... video decoder 
removes each access unit... sampling rate controller... controlled by... time stamps", 
col. 5, lines 30-61 ; where independent clocks suggests different rates and the decoding 
is a separating of demultiplexed data based on time stamps) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian and Lubiarz, to include the 
teaching of Veltman of the audio and video originally captured at a different rate from 
one another and time stamps occur along the time line for the training session in order 
to ensure proper decoding of an encoded data, as described by Veltman (col. 1 , lines 
55-60). 

1 7. Regarding claim 21 , KATSUMI further teaches including instructions for 
associating other portions of the audio with different ones of the visual features that 
indicate the mouth is not moving ("if an audio signal contains a noise such as a page 
turning noise and voices of other people along with a voice of a conference attendee, 
since an image of the motion of the lip portion of a conference attendee can be detected 
from a video signal, the speaking attendee can be determined", KATSUMI, column 7, 
lines 63-67). 
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18. Regarding claim 22, KATSUMI further teaches instructions for: 

identifying second visual features from the video that indicate a different mouth of 
another speaker is moving or not moving ("images of terminals determined as having 
speaking attendees", KATSUMI, column 7, lines 56-57); and 

associating different portions of the audio with selective ones of the second 
visual features that indicate the different mouth is moving ("if an audio signal contains a 
noise such as a page turning noise and voices of other people along with a voice of a 
conference attendee, since an image of the motion of the lip portion of a conference 
attendee can be detected from a video signal, the speaking attendee can be 
determined", KATSUMI, column 7, lines 63-67). 

19. Regarding claim 23, NEFIAN further teaches instructions for: 

processing a neural network ("neural network", NEFIAN, paragraph [0014]) to 
detect the face of the speaker ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]); and 

processing a vector matching algorithm to detect movements of the mouth of the 
speaker within the detected face (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions). 

20. Regarding claim 24, KATSUMI further teaches that the instructions for 
associating further include instructions for matching same time slices associated with a 
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time that the portions of the audio were captured and the same time during which the 
selective ones of the visual features were captured within the video ("when the 
'speaking attendee determination information based on audio signal' represents a voice 
and the 'speaking attendee determination information based on video signal' represents 
a change of the shape of the lip portion simultaneously", KATSUMI, column 6, lines 58- 
63). 

21 . Regarding claim 25, KATSUMI teaches an apparatus, residing in a computer- 
accessible medium, comprising: 

face detection logic ("detects at least the lip portion of a conference attendee 
from the video signal", KATSUMI, column 6, lines 23-34); 

mouth detection logic ("extracts the change amount of the shape of the lip 
portion", KATSUMI, column 6, lines 24-25); and 

audio-video matching logic, wherein the face detection logic detects a face of a 
speaker within a video ("detects at least the lip portion of a conference attendee", 
KATSUMI, column 6, lines 23-34), the mouth detection logic detects and monitors 
movement and non-movement of a mouth included within the face of the video 
("extracts the change amount of the shape of the lip portion", KATSUMI, column 6, lines 
24-25), and the audio-video matching logic matches captured audio with any 
movements identified by the mouth detection logic ("when the 'speaking attendee 
determination information based on audio signal' represents a voice and the 'speaking 
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attendee determination information based on video signal' represents a change of the 
shape of the lip portion simultaneously", KATSUMI, column 6, lines 58-63) 

wherein the video and audio are initially captured together separated for analysis 
("speaking attendee determination information based on audio signal", KATSUMI, 
column 6, lines 50-51; col. 6, lines 57-67; col. 4, lines 34-46; Figure 3; See Response to 
Arguments);.. 

However, KATSUMI does not disclose: specific frequencies occurring within 
captured audio during a training session and for a same time slice of that training 
session, and wherein the mouth is detected as moving by changes in pixels that 
represent the mouth within the face that occur from frame to frame of the video. 

In the same field of audiovisual processing, NEFIAN teaches: 

specific frequencies occurring within captured audio ("13 MFCC coefficients 
extracted from a window of 20 ms", NEFIAN, paragraph [0055]) during a training 
session ("training network and speech recognition module 18", NEFIAN, paragraph 
[0012]) and for a same time slice of that training session ("discrete nodes at time t for 
each HMM are conditioned by the discrete nodes at time t1 of all the related HMMs", 
NEFIAN, paragraph [0023]), and wherein the mouth is detected as moving by changes 
in pixels that represent the mouth within the face that occur from frame to frame of the 
video (see NEFIAN, paragraphs [0018]-[0020], a series of vector calculations are 
performed on the pixels representing the mouth regions) 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
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with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

Katsumi, in view of Nefian, fail to teach where matching specific frequencies 
occurring within captured audio is to determine when the speaker is speaking and when 
the speaker is not speaking. 

Lubiarz teaches where matching specific frequencies occurring within captured 
audio is to determine when the speaker is speaking and when the speaker is not 
speaking ("speech recognition", col. 2, lines 28-37; "detects silence... detects the 
presence of voice activity... optimized... for each of the frequency bands", col. 5, lines 
17-34; exceeding a threshold indicates speech in a frequency band, and performing the 
appropriate processing as per Katsumi and Nefian includes matching the speech 
portion with the appropriate video). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian, to include the teaching of Lubiarz 
of where matching specific frequencies occurring within captured audio is to determine 
when the speaker is speaking and when the speaker is not speaking, in order to ensure 
that the most effective processing for a particular type of signal is applied to that type of 
signal, as described by Lubiarz (col. 1 , lines 5-10). 

Katsumi, in view of Nefian and Lubiarz, fail to teach wherein the visual features 
and audio are initially captured at a different rate from one another and based on time 
stamps. 
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Veltman suggests wherein the visual features and audio are initially captured at a 
different rate from one another and based on time stamps, ("sampling rate clock of the 
audio signal and the frame rate clock of the video system operate independently... time 
stamp", col. 1, lines 42-54; "sampling rate control circuit... for use in decoding the video 
stream and the audio stream respectively... video decoder removes each access unit... 
sampling rate controller... controlled by... time stamps", col. 5, lines 30-61; where 
independent clocks suggests different rates and the decoding is a separating of 
demultiplexed data based on time stamps) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Katsumi, in view of Nefian and Lubiarz, to include the 
teaching of Veltman of wherein the visual features and audio are initially captured at a 
different rate from one another and based on time stamps in order to ensure proper 
decoding of an encoded data, as described by Veltman (col. 1, lines 55-60). 

22. Regarding claim 26, NEFIAN further teaches that the apparatus is used to 
configure a Bayesian network which models the speaker speaking ("video data must be 
fused with audio data using... a coupled hidden Markov model [HMM]", NEFIAN, 
paragraph [0023], where the HMM is a dynamic Bayesian network). 

23. Regarding claim 27, NEFIAN further teaches that the face detection logic 
comprises a neural network ("neural network", NEFIAN, paragraph [0014]). 
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24. Claims 7 and 12 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Katsumi Patent No.: US 6,369,846 ("KATSUMI") in view of Nefian Pub. No.: US 
2003/0212557 ("NEFIAN"), Lubiarz, and Veltman, as applied to Claims 1 and 8, above, 
and further in view of Van Schyndel Patent No.: US 5,940,1 18 ("VAN SCHYNDEL"). 

25. Regarding claim 7, the combination of KATSUMI, NEFIAN, and Libuarz teach all 
the limitations of claim 1 . 

However, KATSUMI, Nefian, and Lubiarz do not specifically disclose suspending 
the capturing of audio during periods where select ones of the captured visual features 
indicate that the speaker is not speaking. 

In the same field of audiovisual processing, VAN SCHYNDEL teaches 
suspending the capturing of audio during periods where select ones of the captured 
visual features indicate that the speaker is not speaking ("uses optical information to 
optimally select and/or steer a microphone array in the direction of the talker", VAN 
SCHYNDEL, column 2, lines 55-58, meaning audio is not captured for someone who is 
not speaking). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use selectable microphones of VAN SCHYNDEL 
with the speaker determination system of KATSUMI and audiovisual matching method 
of NEFIAN, and Lubiarz, in order to not restrict a talker's movement or position (VAN 
SCHYNDEL, column 2, lines 60-61). 
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26. Regarding claim 12, the combination of KATSUMI, NEFIAN, and Lubiarz, teach 
all the limitations of claim 8. 

However KATSUMI, NEFIAN, and Lubiarz, do not specifically disclose 
suspending the capturing of audio when the analysis does not detect the mouths 
moving for the first and second speakers. 

In the same field of audiovisual processing, VAN SCHYNDEL teaches 
suspending the capturing of audio when the analysis does not detect the mouths 
moving for the first and second speakers ("uses optical information to optimally select 
and/or steer a microphone array in the direction of the talker", VAN SCHYNDEL, column 
2, lines 55-58, meaning audio is not captured for someone who is not speaking). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use selectable microphones of VAN SCHYNDEL 
with the speaker determination system of KATSUMI and audiovisual matching method 
of NEFIAN, and Lubiarz, in order to not restrict a talker's movement or position (VAN 
SCHYNDEL, column 2, lines 60-61). 

Conclusion 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. 
The examiner can normally be reached on M-F 7:30-4:00. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone 
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number for the organization where this application or proceeding is assigned is 571- 
273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 

EY 4/7/09 

/Richemond Dorvil/ 

Supervisory Patent Examiner, Art Unit 2626 



